One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, Tony Robinson
(Submitted on 11 Dec 2013 (v1), last revised 4 Mar 2014 (this version, v3))
We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6; a combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline.
The benchmark is available as a code.google.com project; besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline n-gram models.
Comments: Accompanied by a code.google.com project allowing anyone to generate the benchmark data, and use it to compare their language model against the ones described in the paper
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:1312.3005 [cs.CL]
(or arXiv:1312.3005v3 [cs.CL] for this version)
Language Model on One Billion Word Benchmark
Authors:
Oriol Vinyals (vinyals@google.com, github: OriolVinyals),
Xin Pan
Paper Authors:
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu
TL;DR
This is a pretrained model on One Billion Word Benchmark.
If you use this model in your publication, please cite the original paper:
@article{jozefowicz2016exploring,
title={Exploring the Limits of Language Modeling},
author={Jozefowicz, Rafal and Vinyals, Oriol and Schuster, Mike
and Shazeer, Noam and Wu, Yonghui},
journal={arXiv preprint arXiv:1602.02410},
year={2016}
}
Introduction
In this release, we open source a model trained on the One Billion Word
Benchmark (http://arxiv.org/abs/1312.3005), a large language corpus in English
which was released in 2013. This dataset contains about one billion words, and
has a vocabulary size of about 800K words. It contains mostly news data. Since
sentences in the training set are shuffled, models can ignore the context and
focus on sentence level language modeling.
In the original release and subsequent work, people have used the same test set
to train models on this dataset as a standard benchmark for language modeling.
Recently, we wrote an article (http://arxiv.org/abs/1602.02410) describing a
model hybrid between character CNN, a large and deep LSTM, and a specific
Softmax architecture which allowed us to train the best model on this dataset
thus far, almost halving the best perplexity previously obtained by others.
Code Release
The open-sourced components include:
- TensorFlow GraphDef proto buffer text file.
- TensorFlow pre-trained checkpoint shards.
- Code used to evaluate the pre-trained model.
- Vocabulary file.
- Test set from LM-1B evaluation.
The code supports 4 evaluation modes:
- Given provided dataset, calculate the model’s perplexity.
- Given a prefix sentence, predict the next words.
- Dump the softmax embedding, character-level CNN word embeddings.
- Give a sentence, dump the embedding from the LSTM state.
Results
Model | Test Perplexity | Number of Params [billions] |
---|---|---|
Sigmoid-RNN-2048 [Blackout] | 68.3 | 4.1 |
Interpolated KN 5-gram, 1.1B n-grams [chelba2013one] | 67.6 | 1.76 |
Sparse Non-Negative Matrix LM [shazeer2015sparse] | 52.9 | 33 |
RNN-1024 + MaxEnt 9-gram features [chelba2013one] | 51.3 | 20 |
LSTM-512-512 | 54.1 | 0.82 |
LSTM-1024-512 | 48.2 | 0.82 |
LSTM-2048-512 | 43.7 | 0.83 |
LSTM-8192-2048 (No Dropout) | 37.9 | 3.3 |
LSTM-8192-2048 (50\% Dropout) | 32.2 | 3.3 |
2-Layer LSTM-8192-1024 (BIG LSTM) | 30.6 | 1.8 |
(THIS RELEASE) BIG LSTM+CNN Inputs | 30.0 | 1.04 |
How To Run
Prerequisites:
- Install TensorFlow.
- Install Bazel.
- Download the data files:
- It is recommended to run on a modern desktop instead of a laptop.
1 | 1. Clone the code to your workspace. |